Human pathogenic RNA viruses establish noncompeting lineages by occupying independent niches

Significance Numerous pathogenic viruses are endemic in humans and cause a broad variety of diseases, but what is their potential for causing new pandemics? We show that most human pathogenic RNA viruses form multiple, cocirculating lineages with low turnover rates. These lineages appear to be largely noncompeting and occupy distinct epidemiological niches that are not regionally or seasonally defined, and their persistence appears to stem from limited outbreaks in small communities so that only a small fraction of the global susceptible population is infected at any time. However, due to globalization, interaction and competition between lineages might increase, potentially leading to increased diversification and pathogenicity. Thus, endemic viruses appear to merit global attention with respect to the prevention of future pandemics.

: GISAID acknowledgements for the SARS-CoV-2 sequences used in this study. Table S4: Summary of manual sequence curation indicating lab-or vaccine-related keywords used to prune sequences. Table S5: Sequence IDs for EVA and H3N2 e10, e100, d10, and d100 subtrees.  Table S7: Sequence IDs for the H3N2 "new" and "old" subtrees. Table S8: Mutation rates (in units of nucleotide substitutions per site per year) and estimated dates for the last common ancestor of all tree/subtrees. Negative dates represent years prior to 0 CE.

23
Subalignments were considered for H3N2 and EVA, principally for the purpose of effective 24 population size analysis (see below). For each, subalignments were constructed to sample an 25 even number of isolates from every year at 10 and 100 isolates per year with three randomly 26 sampled replicates (6 alignments in total). Two additional alignments were constructed to 27 maximize the sequence diversity (based on Hamming distance, as for SARS-CoV-2) over the 28 same number of isolates as the first set of subalignments given 10 and 100 isolates per year 29 respectively.

30
In all cases sequences were harmonized to DNA (e.g. U was transformed to T to amend software 31 compatibility) and aligned with MAFFT(5), using default settings.

32
Sequences were clustered according to 100% identity with no coverage threshold using CD-HIT 33 (6), and otherwise default settings for MERS and H3N2. The longest sequence from each cluster 34 was selected as a representative. For all other viruses, all sequences were considered for the 35 next steps. Exterior ambiguous characters (preceding/succeeding the first/last defined nucleotide) 36 were removed, and sequences with more than 10 remaining interior, completely ambiguous 37 characters ("N") were discarded. MAFFT was run with default settings to align the cleared unique 38 sequences. Outliers based on hamming distance to the nearest neighbor and consensus were 39 identified and removed from the set. Remaining sequences were aligned with MAFFT, default 40 settings, together with the respective outgroup.

41
Note that, in principle, the redundancy introduced by this clustering step can introduce ambiguity 42 in mapping metadata on the tree. We show that this is not an issue for these datasets below.

43
Exterior ambiguous characters (preceding/succeeding the first/last defined nucleotide) were 44 removed. Sites corresponding to protein-coding ORFs were then mapped to the alignments from 45 the reference sequences excluding stop codons (see Supplementary Data). Noncoding regions 46 were discarded.
These resulting alignments contained out-of-frame gaps. We first identified gaps in the reference 48 sequences corresponding to multiples of three nucleotides present in no more than 90% of all 49 sequences (and thus likely corresponding to "true" insertions relative to the reference at the same 50 position in at least 10% of all sequences). Sites corresponding to less common insertions and 51 non-triplet gaps in the reference were discarded. The remaining gaps represent deletions relative 52 to the reference sequence. Similarly, codons in the frame of the reference sequences where 53 fewer than 10% of sequences in any of the three sites were represented by non-gap characters 54 were removed. Gaps shorter than three nucleotides in the remaining sites were replaced with the ambiguous character N. Longer gaps were shifted into frame and padded with ambiguous 56 characters on either end of the gap to minimize the number of characters changed. Sequences 57 composed of more than 25% gaps were then discarded.

58
Outliers among the remaining sequences were then identified based on the Hamming distance 59 (excluding gaps and ambiguous characters) to the nearest neighbor and removed 60 Unique alignment rows were then identified, and redundant rows were discarded. Such 61 redundancy, in addition to the redundancy introduced by prior clustering for MERS and H3N2, 62 can introduce ambiguity in mapping metadata on the tree as a single, random representative 63 isolate is selected to correspond to each unique row. As discussed above, the SARS-CoV-2 data 64 were subsampled and matched with representative metadata accordingly. Among the remaining 65 twenty-six alignments, only 7 had more than 10% of all isolates removed due to redundancy: 66 Ebola (37%), EVC (15%), HRSV (13%), MERS (16%), MMV (20%), MRV (25%), and H3N2 67 (42%). For EVC, all but 5 redundant sequences mapped to rows in the alignment which were 68 later pruned (see the following section on pruning lab-related isolates and select isolates from 69 non-human hosts). For the remaining 6 it was confirmed that the sequencing dates for the 70 redundant isolates correspond to those of the representative isolates mapped to the tree (see 71 Figure S30) and thus the selection of these representatives does not introduce significant

84
Selecting subalignments for EVA and H3N2

85
In order to analyze the effect of sampling efforts on the effective population size (see below), 86 several subalignments were generated for EVA and H3N2. To maintain an even representation 87 over time, up to either 10 or 100 isolates (not unique sequences) were randomly selected ("even"/ 88 "e"). To construct subalignments maximizing sequence diversity, we collected the same number 89 of sequences as each evenly sampled subalignment through an iterative process maximizing the 90 minimum hamming distance at each step ("diverse"/ "d"). The complete alignment was reduced to 91 the samples within each subset (keeping the outgroup) to retrieve eight final subalignments per 92 virus (3x e_10, 3x e_100 and 1x d_10 and d100). The corresponding sequence IDs for each 93 subalignment can be found in Supplementary Data, Table S5.

96
Dates and locations of isolation are available for many isolates reported as calendar dates and 97 city or country/administrative region of origin. The fraction of isolates for which metadata is 98 available varies across the alignments. Calendar dates may be reported as the day, month, or 99 only year. If the day or month was not specified, the collection date was assumed to be the 100 midpoint of the window. These dates are referenced as calendar dates in the main text and date 101 indices (number of days before/after January 1, 1950) in the supplement for convenience. In 102 practice, for the analysis presented in this work, only the year collected is relevant.

103
For the regional analysis, the latitude and longitude of each city of origin or a representative city 104 for each country/administrative region of origin was identified from simplemaps 105 (https://simplemaps.com/data/world-cities) (7). This information was used to calculate the 106 distance on the globe between pairs of sequencing locations (the great circle distance, not 107 respecting geographic or political boundaries).

Tree Construction
With the exception of SARS-CoV-2 and H3N2, tree topology was optimized using  with the evolutionary model fixed to GTR+F+G4 and the minimum branch length decreased from the default 10e-6 to 10e-7 (options: -m GTR+F+G4 -st DNA -blmin 0.0000001). For SARS-CoV-2, values; and using the previously-constructed maximum diversity subtree as a constraint 118 (compiled at double precision, options: -nt -gtr -gamma -cat 4 -nosupport -constraints).

119
Trees were rooted according to the position of an outgroup selected from representative 120 sequences of the closest relative species whenever the percent identity of the largest protein or 121 polyprotein was at least 30%. In three cases, the outgroup did not meet these criteria. We applied 122 midpoint tooting for HDV andHRSV) and rooted at the oldest sample for H3N2.

124
Manual Parsing of Trees 125 Viral lineages were manually selected representing viral sero-or genotypes, known geographic 126 subtypes, and large, monophyletic clades. No manual lineages were selected for EVB and

132
In addition to these manually established subtypes, we sought to identify individual "outbreaks" 133 within the trees with a well defined evolutionary "trajectory" composed of a monophyletic clade 134 such that within this clade, the distance to the tree root is well correlated with the date of isolation.

135
Moving from root to tip, at each node of the global topology, we calculated the Pearson 136 correlation coefficient between the root distance and the isolation date for all sequences 137 descendent from that node. For a predefined threshold, C, if the correlation was greater than C, 138 this clade was added to the ensemble of correlated clades and the search was stopped at that 139 node. If the correlation was below C, the search would continue until the descendent clades 140 contained fewer than 15 descendent sequences. Such clades with fewer than 15 sequences were 141 not considered further.

142
The threshold, C, was then determined for each tree to be either 0.8 or the largest value such that 143 30% of all sequences with known isolation dates were included in a correlated clade, whichever 144 was smaller with the exception of SARS-CoV-2. The threshold for SARS-CoV-2 was set to 0.75 145 as between 0.75 and 0.8 this parsing scheme dramatically differed from returning a single, 146 inclusive correlated clade to returning over 40 small clades. At 0.75, all but one sequence was 147 included, and consequently we considered all sequences in SARS-CoV-2 to belong to a single 148 correlated clade. C was found to be less than 0.2 for two trees (HDV/0.18 and RVA/0.16); less 149 than 0.65 for five more (Ebola/0.59, EVB/0.41, HCV/0.63, PeVA/0.39 and TBEV/0.49); and 0.8 for 150 all others (with the exception of SARS-CoV-2). H3N2 sequences display a very high root 151 distance/sequencing date correlation (Fig. S7) and nearly all H3N2 sequences formed a single 152 correlated clade even at a threshold value near 1 with the method described above; however, due 153 to the ladder-like tree structure with relatively fewer sequences near the root, parsing near the 154 root with this method is sensitive to outliers and rooting. Consequently, we considered all 155 sequences in H3N2 to belong to a single correlated clade. Note as discussed in the main text, we 156 refer to these correlated clades as "genealogical lineages" or GL.
additionally considered as well as two subtrees for H3N2 representing "newer" and "older" sequences (H3N2n/H3N2o) roughly determined to be sequenced after/before 2010 respectively.

164
All trees, global topologies for each alignment as well as the selected subtrees, with branch 165 lengths corresponding to the number of nucleotide substitutions per site, were paired with the 166 known dates of isolation to construct date-constrained, genealogical trees using least-square 167 dating (with software LSD2) (10) associated with a single mutation rate for each tree/subtree and, 168 consequently, a predicted date for the LCA of each tree/subtree (Supplementary Data, Table S8).

169
When not known, the isolation dates were estimated and unconstrained so that, in some trees, 170 select sequences are predicted to have been isolated in the future.

171
Many of these clades are far from the tree root with few sequences interspersed between the 172 clade root and the tree root. The mutation rate along these long, deep connecting branches may 173 differ significantly from that estimated over the whole tree or among the GLs. If present, such a 174 rate mismatch would substantially change the predicted date for the LCA. This date, as well as 175 the predicted dates of other deep nodes, is used to estimate the effective population size. We are 176 primarily interested in establishing a lower bound, but not necessarily an upper bound, for the 177 effective population size (see below) and given the sparsity of these branches, a partitioned 178 evolutionary model is unlikely to yield statistically significant results. Alternatively, we constructed 179 truncated global genealogical trees, or "grafted trees", as follows. First the deepest root (oldest 180 LCA) of any GL was identified. This was taken to be the root for the grafted tree. Next, the 181 remaining GLs were connected to this (now multifurcated) root by branches reflecting the 182 difference between the LCA of each clade and the root. These grafted trees represent an upper 183 bound on the mutation rates over these branches and the effective population size estimated over 184 these grafted trees represents the lower bound for the effective population size of each virus 185 respecting the genealogical trees estimated over each GL.

186
Given the original trees as well as the date-constrained, genealogical trees, we sought to 187 construct a measure representing the rate at which older GLs were displaced by more recent GLs 188 (for those trees with at least 2 GLs identified). This is related to the rate at which clades go 189 extinct. We considered the Shannon entropy of the clade distribution calculated over sliding

197
The entropy is reported in units of base N, the number of GLs in the tree so that the maximum 198 value is 1 and the minimum value is 0 for all trees. A mean entropy near 0 represents a 199 genealogical tree composed of clades which quickly displace one another. A mean entropy near 200 1 represents a genealogical tree where all clades are equally distributed at every time point. We 201 observe 〈 〉 > 〈 〉 in all but one case (HDV) indicating that the entropy is larger than that 202 expected from an analysis of the tree structure with no known dates of isolation (see Fig. 3D).

253
The effective population size Ne and the ratio of the census population size N over Ne was 254 estimated as previously described (13)

289
Simulating trees with respect to selection strength and sampling density

290
In order to demonstrate the potential equivalency between the impacts of selection strength and 291 sampling density on effective population size, we simulated an ensemble of trees as follows.

292
Beginning with a rooted, bifurcated tree consisting of 2 leaves with uniformly distributed random 293 branch lengths (between 0 and 1), the tree was extended by adding a multifurcated node at the 294 "end" of a leaf so that the terminal branch length became the branch leading to the added 295 multifurcation. The branch lengths of all leaves within the multifurcation were also random and 296 uniformly distributed between 0 and 1. This process was continued for 500 iterations. At each 297 iteration up to 10 leaves were selected for extension. To represent increased sampling, the 298 multifurcation multiplicity was varied from 2 to 6. To represent increased selection strength, 299 leaves were selected for extension only if they were positioned beyond a given threshold distance 300 to the root of the tree. This threshold was set to be the 0th, 25th, 50th, 75th, or 95th percentile of 301 the root distance distribution for the entire tree at that iteration. Note that in practice,

302
multifurcations were represented as bifurcating subtrees with arbitrary topology and arbitrarily 303 short interior branch lengths.

305
threshold is very high for viruses with ancient MRCAs leaving too few remaining sequences for more than 100k leaves could be generated but this becomes computationally impractical and is 329 unlikely to qualitatively affect the results presented).
select the 100 leaves with the greatest distance to the root. Simulated trees with more than 100 332 leaves above this threshold but fewer than the number of sequences in the real phylogeny were 333 pruned above the threshold. Simulated trees with more leaves above the threshold than 334 sequences in the real phylogeny were pruned above the threshold and then randomly sampled to 335 contain the same number of sequences. If the final, pruned, simulated tree contained fewer 336 sequences than the real phylogeny, the real phylogeny was then subsampled to maximize  was computed ( Figure S41).

366
Metadata all sequences is available in Supplementary Data, Table S11.  Table S1: Virus characteristics including abbreviation, family, order, tax id, sequence length 460 threshold (for nearly complete genomes), vaccination status, mode of transmission, circulation 461 (human or human/zoonotic), available treatment, and disease progression (acute or chronic).  Table S3: GISAID acknowledgements for the SARS-CoV-2 sequences used in this study. Table S4: Summary of manual sequence curation indicating lab-or vaccine-related keywords 465 used to prune sequences. Table S5: Sequence IDs for EVA and H3N2 e10, e100, d10, and d100 subtrees.   Table S7: Sequence IDs for the H3N2 "new" and "old" subtrees.

507
Open circles represent leaves which are not included in any correlate clade.

517
Open circles represent leaves which are not included in any correlate clade.

527
Open circles represent leaves which are not included in any correlate clade.